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Abstract 

Deep Neural Networks now excel at image classification, detection and segmentation. 
When used to scan images by means of a sliding window, however, their high computational 
complexity can bring even the most powerful hardware to its knees. We show how dynamic 
programming can speedup the process by orders of magnitude, even when max-pooling layers 
are present. 

1 Introduction 

Deep Max-Pooling Convolutional Neural Networks are Deep Neural Networks (DNN) with con- 
volutional and max-pooling layers. Convolutional Neural Networks (CNN) can be traced back to 
the Neocognitron [T] in 1980. They were first successfully applied to relatively small tasks such 
as digit recognition f7, image interpretation [3] and object recognition 4J. Back then their size 
was greatly limited by the low computational power of available hardware. Since 2010, however, 
DNN have greatly profited from Graphics Processing Units (GPU). Simple CPU-based multilayer 
perceptrons (MLP) establised new state of the art results ^5 on the MNIST handwritten digit 
dataset [4 when made both deep and large (augmenting the training set by artificial samples 
helped to avoid overfitting). 2011 saw the first implementation [3] of CPU-based DNN on the 
CUDA parallel computing platform [J. This yielded new benchmark records on multiple object 
detection tasks. The field of Deep Learning with Neural Networks exploded. Multi-Column DNN 
[8] improved previous results by over 30% on many benchmarks including: handwritten digits 
(MNIST) and Latin letters (NIST SD 19) 0; Chinese characters (CASIA) [10 ; traffic signs 
(CTSRB) [ilj; natural images (CIFAR 10) [11]. Another flavor of DNN ,13J greatly improved the 
accuracy on a subset of ImageNet [H]. Recently, Google parallelized a large DNN on a cluster 
with over 10000 CPU cores [IF. 

For image classification, the DNN returns a vector of class posterior probabilities when provided 
with an input patch whose fixed width and height usually does not exceed a few hundreds of pixels 
and depends on the network architecture. But DNN also excel at image segmentation and object 
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detection [TB] . For segmentation, image data within a square patch of odd size is used to determine 
the class of its central pixel. The network is trained on patches extracted from a set of images with 
ground truth segmentations (i.e. the class of each pixel is known). To segment an unseen image, 
the trained net is used to classify all of its pixels. Object detection within an image is trivially 
cast as a segmentation problem: pixels close to the centroid of each object are classified differently 
from background pixels. Once an unseen image is segmented, the centroid of each detected object 
is determined using simple image processing techniques. 

Solving segmentation and detection tasks requires to apply the network to every patch con- 
tained in the image, which is prohibitively expensive when implemented in the naive, straightfor- 
ward way. Consider a net with a convolutional layer immediately above the input layer: when 
evaluating the first patch contained in the input image, the patch is convolved with a large number 
of kernels to compute the output maps; when evaluating the next (typically overlapping) patch, 
such convolutions are re-evaluated - a huge amount of redundant computation. It is better to 
compute each convolution only once for the whole input image: the resulting set of images (which 
we will refer to as extended maps) contain the maps for each patch contained in the input image. 

In the particular case of a CNN without max-pooling layers, this optimization is trivially 
implemented by computing all convolutions in the first layer on the entire input image, then 
computing all convolutions in subsequent layers on the resulting extended maps. This approach 
[TT] yielded real time detection performance |T8] when combined with dedicated FPGA or even 
ASIC integrated circuits. 

However, present DNN owe much of their power to max-pooling layers interleaved with con- 
volutional layers. Max-pooling cannot be handled using the straightforward approach outlined 
above. For example, when we perform a 2 x 2 max-pooling operation on an extended map, we 
obtain a smaller extended map which does not contain information from all the patches contained 
in the input image; instead, only patches whose upper left corner lies at even coordinates of the 
original image are represented. Any subsequent max-pooling layer would further aggravate the 
problem. 

Our contribution consists in an optimized forward-propagation approach which avoids such 
problems by fragmenting the extended maps resulting from each max-pooling layer, such that 
each fragment contains information independent of other fragments, and the union of all frag- 
ments contains information about all the patches in the input image. A similar approach was 
previously used for handling a single subsampling layer in a simple CNN for face detection. 
Our mechanism, however, is completely general. It handles arbitrary architectures mixing con- 
volutional and max-pooling layers in any order, and ensures that no redundant computation is 
performed at any stage. 

2 Method 

We consider nets composed by four types of layers [S]: input, convolutional, max-pooling and 
fully-connected. In the following, different layers are denoted by index Nets are formed by 
an input layer (l = 0), followed by a set of convolutional and max-pooling layers I G {1, . . . ,L}, 
followed by a number of fully-connected layers. The optimization described in this paper concerns 
convolutional and max-pooling layers, and allows to find the outputs of layer L. We do not 
discuss forward-propagation in fully-connected layers, where a trivial approach does not suffer 
from redundant computations. 
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To simplify notation, we consider nets with square maps and square kernels, but this is trivially 
generalized to rectangular maps and filters. Sets are denoted by bold symbols. 

We first recall how convolutional and max-pooling layers are forward-propagated at the patch 
level (Section 2.1). Then we extend the approach by the proposed optimization, which operates 
at the level of the whole image (Section 2.2 1. Figure IT] illustrates both approaches. 



2.1 Patch-level testing 

The square input patch is represented as a set Pq containing one or more input maps (depending 
on the number of input channels). Let wq denote the width and height of such maps (i.e., the size 
of the input patch). Because the input image and all kernels are assumed square, maps obtained 
as the output of any intermediate layer I (i.e., contained in Pj) will be square. 



2.1.1 Convolutional layers 

Let I denote the index of a convolutional layer, 
size wi- Pi is obtained as a function of Pj-i 
the (square) kernels of layer I. In general, the 
layer, i.e.: |P,| ^ |P,_i|. 



The layer's output is a set P; of square maps with 
[6]. wi — — [k — 1), where k is the width of 
number of maps may change after a convolutional 



2.1.2 Max-pooling layers 

Let I denote the index of a max-pooling layer. The layer's output is a set P; of square maps with 
size wi. P/ is obtained as a function of Pi_i [6]. = wi-i/k, where k is the size of the square 
max-pooling kernel; the architecture of the net is such that mod(i(;/_i,fc) = 0. The number of 
maps is unchanged after a max-pooling layer, i.e. |P;| — |P/_i|. 



2.2 Optimized testing on images 

Let us now consider a square input image of size s > Wq. We want to compute the network outputs 
on all windows completely contained within it - i.e. {s ~ wq + 1)^ windows (patches). 

We represent the output data of a given layer I of the net as a set containing Fi fragments. 
We denote each fragment in layer I by an index f G {1, . . . , Fi}. Each fragment / G {1, . . . , F/} is 
associated to a set of extended maps. 

Extended maps in the same fragment all have the same size; extended maps in different frag- 
ments may have different sizes, and not all such sizes may be square even though the input image 
is. Let s!^ I and s^ ; denote width and height of the extended maps in set , respectively. 

The input image is provided as a single fragment, therefore Fq — 1. Such fragment contains a 
set 1q of square extended maps with sizes 

4.0 = s (1) 
4o = (2) 

2.2.1 Convolutional layers 

Let I denote the index of a convolutional layer. Its output consists of a set of fragments, such that 
Fi = Fi^i. 
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Figure 1: Patch-based and image-based forward propagation techniques for convolutional and 
max-pooling layers. 

A given fragment f £ {1, . . . , Fi} contains a set l( of extended maps; each has size 

(3) 
(4) 

where k is the size of the (square) kernels of layer /. Again, note that maps in different fragments 
may have different sizes, but all maps in the same fragment have the same size. 

l{ is obtained as a function of l/L^, using the same operations as with patch-level forward 
propagation. However, convolutions are performed on the (large) extended maps rather than on 
small maps, like in the patch-level approach. 

2.2.2 Max-pooling layers 

Let I denote the index of a max-pooling layer. Its output consists of a set of fragments, such that 
Fi = fc^_Fi_i fragments, where k is the size of the square max-pooling kernel. In particular, each 
input fragment fm G {1, . . . generates k^ output fragments. 

Consider a given input fragment associated with the set containing the input extended 
maps. Let o be a set of fc^ 2D offsets defined as the Cartesian product {0, 1, ... , fc— l}x {0, 1, . . . , k— 
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1}. E.g., for k = 2: 

o = {(0,0),(l,0),(0,l),(l,l)}. 

For a given input fragment fm and for each offset o G o, o = {ox,Oy), one output fragment / 
is produced. Each of the extended maps in / is generated by applying the max-pooling operation 
to the corresponding extended map in /in, by starting at the top left offset (x, y) — a. Specifically, 
the pixel at coordinates {x, y) in the output map is computed as the maximum of all pixels in the 
corresponding input map at coordinates (x, y) such that: 

Ox + kx < X < Ox + kx + k — 1 , , 

Oy + ky < y < Oy + ky + k — 1 . 

Then the size of the extended maps in is given by: 

sf,; = div((s^;°_^-o^) ,fc) (6) 
4.1 = div((s^'J_i-Oj,) ,fc) , (7) 

where div denotes the integer division operation. The max-pooling operation thus ignores the 
following parts of the input extended maps: 

• Oy leftmost columns; 

• Ox top rows; 

• mod ( (s{^"l_i — Ox) ,k] rightmost columns; 



• mod ( ( ~ o-,. ) .k] bottom rows. 



3 Discussion and Results 

Convolutional layers do not alter the number of fragments (and operate on each fragment indepen- 
dently) , whereas each max-pooling layer produces k^ times the number of fragments given at its 
input. Therefore, the final number of fragments generated by a net is equal to the product of the 
squares of the kernel sizes of all its max-pooling layers; for example, the net in Table [l] produces 
2^ • 2^ • 2^ • 2^ = 256 fragments at the output of layer 8. Note that for all layer types, including 
fully-connected layers, data in a fragment at layer I only depends on data in a single fragment at 
layer I — 1. 

Let wi be the size of the map at layer I when using the patch-based approach. Now consider 
our image-based approach, and the set ij of extended maps at a given fragment / for layer I. 
Any wi X wi subimage cropped from such extended maps corresponds to the contents of the 
corresponding maps for some patch contained in the original image. A single fragment contains 
data for a subset of the patches contained in the original image. Collectively, all fragments at a 
given layer contain data for the whole set of all patches contained in the original image. 
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Table 1: 11-layer architecture for network N4 used in P^. 



Layer 


Type 


Maps (|P,1) 


Kernel 


(0 




and neurons 


{ki X ki) 





input 


1 map of 95x95 neurons 




1 


convolutional 


48 maps of 92x92 neurons 


4x4 


2 


max pooling 


48 maps of 46x46 neurons 


2x2 


3 


convolutional 


48 maps of 42x42 neurons 


5x5 


4 


max pooling 


48 maps of 21x21 neurons 


2x2 


5 


convolutional 


48 maps of 18x18 neurons 


4x4 


6 


max pooling 


48 maps of 9x9 neurons 


2x2 


7 


convolutional 


48 maps of 6x6 neurons 


4x4 


8 


max pooling 


48 maps of 3x3 neurons 


2x2 


9 


fully connected 


200 neurons 


1x1 


10 


fully connected 


2 neurons 


1x1 



Table 2: Theoretically required FLOPS for convolutional layers when segmenting a 512 x 512 image 
using patch-based (FLOPSf^^'^'') and image-based (FLOPS™''®'') approaches. Net architecture in 
Tabled! See text for details. 



Layer (l) 


s 


Sl-l 






Wl 


ki 


Fi 


FLOPSr*"''[-10^] 


FLOPSr^'^'l-lO^] 


speedup 


1 


512 


559 


1 


48 


92 


4 


1 


3408 


0.5 


7114.8 


3 


512 


279 


48 


48 


42 


5 


4 


53271 


35.9 


1485.1 


5 


512 


139 


48 


48 


18 


4 


16 


6262 


22.8 


274.7 


7 


512 


69 


48 


48 


6 


4 


64 


695 


22.5 


30.9 


Total 
















63636 


81.6 


779.8 



3.1 Theoretical speedup 

We now discuss the speedup of our image-based approach in comparison to separate evaluation of 
all patches contained in the input image. We consider as an example the largest network (Table [T]) 
used in |16j for neuronal membrane segmentation |20) . The image size (one slice with neuronal 
tissue data) is 512 x 512 pixels (see Figure [2]). Its edges are mirrored, to get enough pixels for 
applying the network to all positions. We limit our analysis to convolutional layers, which are by 
far the most computationally intensive part of a DNN. Conversely, max-pooling layers are simple 
and fast, requiring less than 1% of the computing time in most practical DNN. 

For the patch-based approach, the required amount of floating-point operations (FLOPS) for 
computing the convolutions in layer I when scanning an image by a DNN obeys the following 
formula: 



FLOPSP*"^ = . |P,_i| . |P,| . wf ■ kf ■ 2, 
where is the number of pixels in the input image, and, for each convolutional layer /, |P/| denotes 
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the number of maps, wf the number of pixels of the map, and kf the number of kernel pixels. 
The factor "2" reflects that we have one addition and one multiplication for each component of 
the dot product. 

For the image-based approach, the FLOPS can be computed using the following formula: 
pLOPS™-so ^ ^^^^ . ^^^^ . |p^_^| .\p^\.Fi. kf ■ 2, 

where Sx,i • Sy^i represents the size of a fragment in layer I (to simplify the computation, we assume 
all fragments have the same size, although they may differ in size by at most one pixel). For the 
input layer (/ — 0), mirroring the borders implies that s^fi = Sy^ = s + {wq — l)/2. 

Table [2] reports such computations for all convolutional layers in the network of Table [T] The 
patch-based approach requires 779.8 times more FLOPS than the image-based approach. 

3.2 Experimental speedup 

In Table|3]we report computation times of the DNN in Table[T] when used to segment a 512 x 512 
image using three different implementations: 

matlab-patch a plain MATLAB implementation of the patch-based approach; 

GPU-patch a heavily optimized implementation of the patch-based approach running on a GTX- 
580 graphics card using CUDA; 

matlab-image a plain MATLAB implementation of the image-based approach. 



Table 3: Speed for segmenting a 512 x 512 image using the net in table [T] 



Method 


Time per 


Speedup relative 




image [s] 


to GPU-patch 


matlab-patch 


24641.54 




GPU-patch 


492.83 


1 


matlab-image 


15.05 


32.8 



Results clearly show that the image-based implementation yields a dramatic speedup over 
patch-based approaches. In particular, matlab-image yields a 32-fold speedup when compared 
to the highly-optimized GPU-patch implementation, despite the former being implemented in a 
slower environment and without attention to low-level optimizations. The impact of GPU and 
low-level optimizations is obvious as the GPU-patch approach is 50 times faster than matlab-patch. 

4 Conclusions 

We greatly sped up forward-propagating deep neural networks on sliding windows. Our approach 
handles the complications due to max-pooling layers interleaved with convolutional layers, avoiding 
unnecessary computations. This is important for fast object detection and image segmentation. 
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For huge nets such as those winning the ISBI Electron Microscopy Segmentation Challenge ^nil?!] 
(see Figure[2]), our approach is in theory almost three orders of magnitude faster than a straightfor- 
ward patch-based forward-propagation approach. In practice, a simple MATLAB implementation 
yields a 32-fold speedup over a highly optimized patch-based GPU implementation. 




Figure 2; An input electron microscopy slice (left) and the corresponding segmentation output 
(right). Data from the ISBI EM segmentation challenge [3D]. As our approach is an exact method, 
patch-based and image-based approaches yield identical results. 
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